Problem Statement¶
Business Context¶
Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
Objective¶
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
- True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
- False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
- False positives (FP) are detections where there is no failure. These will result in inspection costs.
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
Data Description¶
The data provided is a transformed version of the original data which was collected using sensors.
- Train.csv - To be used for training and tuning of models.
- Test.csv - To be used only for testing the performance of the final best model.
Both the datasets consist of 40 predictor variables and 1 target variable.
Installing and Importing the necessary libraries¶
# Installing the libraries with the specified version
# !pip install --no-deps tensorflow==2.18.0 scikit-learn==1.3.2 matplotlib===3.8.3 seaborn==0.13.2 numpy==1.26.4 pandas==2.2.2 -q --user --no-warn-script-location
!pip install tensorflow scikit-learn matplotlib seaborn numpy pandas --user
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
import time
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to split the data into train and test
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
#Imports tools for data preprocessing including label encoding, one-hot encoding, and standard scaling
from sklearn.preprocessing import LabelEncoder, OneHotEncoder,StandardScaler
#Imports a class for imputing missing values in datasets.
from sklearn.impute import SimpleImputer
import tensorflow as tf #An end-to-end open source machine learning platform
from tensorflow import keras # High-level neural networks API for deep learning.
from keras import backend # Abstraction layer for neural network backend engines.
from keras.models import Sequential # Model for building NN sequentially.
from keras.layers import (
Dense,
Dropout,
Activation,
BatchNormalization # Layers for building NN.
)
# Libraries to get different metric scores
from sklearn import metrics
from sklearn.utils import class_weight
from sklearn.metrics import (
confusion_matrix,
ConfusionMatrixDisplay,
accuracy_score,
precision_score,
recall_score,
f1_score,
classification_report,
)
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
Requirement already satisfied: tensorflow in /Users/ernestholloway/.local/lib/python3.12/site-packages (2.16.2) Requirement already satisfied: scikit-learn in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (1.6.1) Requirement already satisfied: matplotlib in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (3.10.0) Requirement already satisfied: seaborn in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (0.13.2) Requirement already satisfied: numpy in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (1.26.4) Requirement already satisfied: pandas in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (2.2.2) Requirement already satisfied: absl-py>=1.0.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (2.3.1) Requirement already satisfied: astunparse>=1.6.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (1.6.3) Requirement already satisfied: flatbuffers>=23.5.26 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (25.2.10) Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (0.6.0) Requirement already satisfied: google-pasta>=0.1.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (0.2.0) Requirement already satisfied: h5py>=3.10.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (3.14.0) Requirement already satisfied: libclang>=13.0.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (18.1.1) Requirement already satisfied: ml-dtypes~=0.3.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (0.3.2) Requirement already satisfied: opt-einsum>=2.3.2 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (3.4.0) Requirement already satisfied: packaging in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (24.1) Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (4.25.8) Requirement already satisfied: requests<3,>=2.21.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (2.32.3) Requirement already satisfied: setuptools in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (75.1.0) Requirement already satisfied: six>=1.12.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (1.17.0) Requirement already satisfied: termcolor>=1.1.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (3.1.0) Requirement already satisfied: typing-extensions>=3.6.6 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (4.12.2) Requirement already satisfied: wrapt>=1.11.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (1.17.2) Requirement already satisfied: grpcio<2.0,>=1.24.3 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (1.73.1) Requirement already satisfied: tensorboard<2.17,>=2.16 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (2.16.2) Requirement already satisfied: keras>=3.0.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorflow) (3.10.0) Requirement already satisfied: scipy>=1.6.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from scikit-learn) (1.15.2) Requirement already satisfied: joblib>=1.2.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from scikit-learn) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from scikit-learn) (3.5.0) Requirement already satisfied: contourpy>=1.0.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (1.3.1) Requirement already satisfied: cycler>=0.10 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (4.55.3) Requirement already satisfied: kiwisolver>=1.3.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (1.4.8) Requirement already satisfied: pillow>=8 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (11.1.0) Requirement already satisfied: pyparsing>=2.3.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (3.2.0) Requirement already satisfied: python-dateutil>=2.7 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from matplotlib) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from pandas) (2025.2) Requirement already satisfied: wheel<1.0,>=0.23.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from astunparse>=1.6.0->tensorflow) (0.44.0) Requirement already satisfied: rich in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from keras>=3.0.0->tensorflow) (14.0.0) Requirement already satisfied: namex in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from keras>=3.0.0->tensorflow) (0.1.0) Requirement already satisfied: optree in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from keras>=3.0.0->tensorflow) (0.16.0) Requirement already satisfied: charset-normalizer<4,>=2 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (3.7) Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (2.2.3) Requirement already satisfied: certifi>=2017.4.17 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (2025.4.26) Requirement already satisfied: markdown>=2.6.8 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorboard<2.17,>=2.16->tensorflow) (3.8.2) Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorboard<2.17,>=2.16->tensorflow) (0.7.2) Requirement already satisfied: werkzeug>=1.0.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from tensorboard<2.17,>=2.16->tensorflow) (3.1.3) Requirement already satisfied: MarkupSafe>=2.1.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from werkzeug>=1.0.1->tensorboard<2.17,>=2.16->tensorflow) (3.0.2) Requirement already satisfied: markdown-it-py>=2.2.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from rich->keras>=3.0.0->tensorflow) (3.0.0) Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from rich->keras>=3.0.0->tensorflow) (2.19.1) Requirement already satisfied: mdurl~=0.1 in /Users/ernestholloway/miniconda3/lib/python3.12/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.0.0->tensorflow) (0.1.2)
2025-08-04 01:40:52.210207: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Global Data Values¶
# Random state for reproducibility
RS = 42
# Validation size for train-validation split
VS = 0.25
# Scoring function for model evaluation. Note to please see above section Important Information About Work Visas for a explanation on why I am choosing recall as the scoring metric.
SCORER = metrics.make_scorer(metrics.recall_score)
# Set class weights as imbalanced data is used
CLASS_WEIGHTS = {0:1.0, 1: 10.0}
EPOCHS = 30
BATCH_SIZE = 16
LEARNING_RATE=0.001
THRESHOLD = 0.5
Common Methods¶
- To save and reduce code duplication I'm predefining methods that will be used frequently through the EDA and Data Modeling steps. The methods will cover:
- Confusion Matrix Rendering
- Correlation Matrix Rendering
- Box Plot and Histogram Rendering
- Scatter Plot and Count Plot Rendering
# Function to plot the confusion matrix
# Parameters:
# model: The trained model to evaluate.
# x: Features used for prediction.
# y: True labels for the features.
# title: Optional title for the confusion matrix plot.
def draw_confusion_matrix(model, x, y, title=None):
y_pred = model.predict(x) > THRESHOLD # Predict probabilities and convert to binary predictions based on the threshold
cm = confusion_matrix(y, y_pred, labels=[0, 1])
# Normalize by row (example)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
# Create labels for each cell in the confusion matrix with both count and percentage
labels = np.asarray(
[
f"{item}\n{item / cm.sum():.2%}"
for item in cm.flatten()
]
).reshape(cm.shape)
# Create the confusion matrix display and turn off the grid
# and set the display labels to 'No' and 'Yes'
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['No', 'Yes'])
disp.plot(include_values=False) # Prevent default annotation
ax = disp.ax_
# Set the title of the confusion matrix plot if provided
if title is not None:
ax.set_title(title)
for (i, j), label in np.ndenumerate(labels):
ax.text(j, i, label, ha='center', va='center', color='black', fontsize=12)
plt.grid(False) # Turn off the grid
def highlight_strong_correlations(val):
color = ''
if abs(val) >= THRESHOLD and abs(val) < 1: # Exclude self-correlation of 1
color = 'background-color: green'
return color
def draw_default_correlation_matrix(data):
"""
Plot the correlation matrix for the given DataFrame.
Parameters:
data (DataFrame): The DataFrame containing the data.
"""
cols_list = data.select_dtypes(include=np.number).columns.tolist()
# Calculate the correlation matrix
correlation_matrix = data[cols_list].corr()
styled_corr = correlation_matrix.style.applymap(highlight_strong_correlations)
display(styled_corr)
# Plot the correlation matrix using a heatmap
plt.figure(figsize=(64, 32))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', vmin=-1, vmax=1, square=True, cbar_kws={"shrink": .8})
plt.title('Correlation Matrix')
plt.show()
#Method to draw box plot and histogram for univariate analysis
# Parameters:
# data: The DataFrame containing the data.
# column_name: The name of the column for which the box plot and histogram will be drawn.
def draw_boxplot_and_histogram(data, column_name):
plt.figure(figsize=(18, 6))
# Histogram
plt.subplot(1, 2, 1)
sns.histplot(data[column_name], bins=30, kde=True)
plt.title(f'{column_name} Distribution')
# Box Plot
plt.subplot(1, 2, 2)
sns.boxplot(x=data[column_name])
plt.title(f'{column_name} Box Plot')
plt.show()
Note:
- After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
Loading the Data¶
train_data = pd.read_csv("Train.csv")
test_data = pd.read_csv("Test.csv")
#Make a copy of the train and test data
train_data_copy = train_data.copy()
test_data_copy = test_data.copy()
Data Overview¶
# Check that the percentage of the output variable is the same in both train and test datasets
train_percentage = train_data['Target'].value_counts(normalize=True)
test_percentage = test_data['Target'].value_counts(normalize=True)
print("Train Data Output Percentage:\n\n", train_percentage)
print("Test Data Output Percentage:\n", test_percentage)
Train Data Output Percentage: Target 0 0.9445 1 0.0555 Name: proportion, dtype: float64 Test Data Output Percentage: Target 0 0.9436 1 0.0564 Name: proportion, dtype: float64
- As a quick sanity check we can see that the percentage of the Target variable false to true ratios as approximately 95% false, to 5% true in both the Training data set and test data set which is good. This means that the test data is representative of the correct population ratio.
- The challenging thing, however, with this data set is that it is heavily unbalanced which means that there isn't a lot of failure data included in the dataset for both the training and test sets. We will want to use the class_weight option where we can give more weight to the minority output class value of 1, which indicates a wind turbine failure.
- The data set has a binary classification of either 0 no failure, or 1 for failure for the wind turbine. This means we can use Sigmoid for output layer since it is best at binary classification which this problem set aligns with.
train_data_copy.head(10) # Display the first 10 rows of the training data copy
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.464606 | -4.679129 | 3.101546 | 0.506130 | -0.221083 | -2.032511 | -2.910870 | 0.050714 | -1.522351 | 3.761892 | -5.714719 | 0.735893 | 0.981251 | 1.417884 | -3.375815 | -3.047303 | 0.306194 | 2.914097 | 2.269979 | 4.394876 | -2.388299 | 0.646388 | -1.190508 | 3.132986 | 0.665277 | -2.510846 | -0.036744 | 0.726218 | -3.982187 | -1.072638 | 1.667098 | 3.059700 | -1.690440 | 2.846296 | 2.235198 | 6.667486 | 0.443809 | -2.369169 | 2.950578 | -3.480324 | 0 |
| 1 | 3.365912 | 3.653381 | 0.909671 | -1.367528 | 0.332016 | 2.358938 | 0.732600 | -4.332135 | 0.565695 | -0.101080 | 1.914465 | -0.951458 | -1.255259 | -2.706522 | 0.193223 | -4.769379 | -2.205319 | 0.907716 | 0.756894 | -5.833678 | -3.065122 | 1.596647 | -1.757311 | 1.766444 | -0.267098 | 3.625036 | 1.500346 | -0.585712 | 0.783034 | -0.201217 | 0.024883 | -1.795474 | 3.032780 | -2.467514 | 1.894599 | -2.297780 | -1.731048 | 5.908837 | -0.386345 | 0.616242 | 0 |
| 2 | -3.831843 | -5.824444 | 0.634031 | -2.418815 | -1.773827 | 1.016824 | -2.098941 | -3.173204 | -2.081860 | 5.392621 | -0.770673 | 1.106718 | 1.144261 | 0.943301 | -3.163804 | -4.247825 | -4.038909 | 3.688534 | 3.311196 | 1.059002 | -2.143026 | 1.650120 | -1.660592 | 1.679910 | -0.450782 | -4.550695 | 3.738779 | 1.134404 | -2.033531 | 0.840839 | -1.600395 | -0.257101 | 0.803550 | 4.086219 | 2.292138 | 5.360850 | 0.351993 | 2.940021 | 3.839160 | -4.309402 | 0 |
| 3 | 1.618098 | 1.888342 | 7.046143 | -1.147285 | 0.083080 | -1.529780 | 0.207309 | -2.493629 | 0.344926 | 2.118578 | -3.053023 | 0.459719 | 2.704527 | -0.636086 | -0.453717 | -3.174046 | -3.404347 | -1.281536 | 1.582104 | -1.951778 | -3.516555 | -1.206011 | -5.627854 | -1.817653 | 2.124142 | 5.294642 | 4.748137 | -2.308536 | -3.962977 | -6.028730 | 4.948770 | -3.584425 | -2.577474 | 1.363769 | 0.622714 | 5.550100 | -1.526796 | 0.138853 | 3.101430 | -1.277378 | 0 |
| 4 | -0.111440 | 3.872488 | -3.758361 | -2.982897 | 3.792714 | 0.544960 | 0.205433 | 4.848994 | -1.854920 | -6.220023 | 1.998347 | 4.723757 | 0.709113 | -1.989432 | -2.632684 | 4.184447 | 2.245356 | 3.734452 | -6.312766 | -5.379918 | -0.886667 | 2.061694 | 9.445586 | 4.489976 | -3.945144 | 4.582065 | -8.780422 | -3.382967 | 5.106507 | 6.787513 | 2.044184 | 8.265896 | 6.629213 | -10.068689 | 1.222987 | -3.229763 | 1.686909 | -2.163896 | -3.644622 | 6.510338 | 0 |
| 5 | 0.159623 | -4.233781 | -0.264310 | -5.477119 | -0.190854 | -0.356274 | -0.134486 | 4.066608 | -3.858569 | 1.692441 | 0.137901 | 3.974719 | 0.672853 | 1.878144 | 0.764158 | 4.235913 | -2.129272 | 2.348465 | -2.147454 | -0.982376 | 0.386345 | 1.010637 | 3.418654 | 0.996017 | 0.060580 | -3.036740 | 1.787573 | -1.726537 | 0.307837 | 1.902350 | 4.665858 | 3.227235 | 0.628900 | -1.548860 | 1.321979 | 5.461345 | 1.109410 | -3.869993 | 0.273964 | 2.805941 | 0 |
| 6 | -0.184565 | -4.721470 | 0.864988 | -3.078695 | -2.226888 | -1.282220 | -0.804717 | 3.289733 | -1.567971 | 0.749904 | 0.528830 | 3.220564 | 2.945183 | 1.724073 | -0.923123 | 2.534830 | -1.696713 | 0.677068 | -0.246087 | 2.747678 | -1.165392 | 0.247621 | 1.160684 | -2.850139 | 0.503405 | -3.532215 | 1.861243 | -1.465354 | 0.873767 | 2.418470 | 0.939376 | -0.544941 | -0.762921 | 0.815558 | 1.889373 | 3.624347 | 1.555740 | -5.432884 | 0.678703 | 0.464697 | 0 |
| 7 | 1.734840 | 1.682945 | -1.269070 | 4.600630 | -1.416975 | -2.543916 | 0.131648 | -0.198661 | 3.094057 | -1.109324 | -1.662364 | 0.943806 | 3.481045 | 0.137055 | -3.472977 | -4.075917 | 1.726571 | -1.908618 | 3.569249 | 2.512191 | -4.578679 | 3.062674 | 3.686149 | 0.610743 | -0.429539 | 0.880126 | -0.993851 | 1.134221 | -3.767917 | -0.692236 | -5.244396 | 1.717474 | -3.838931 | 1.569448 | 1.794899 | -4.268517 | -0.516195 | -0.619218 | -0.830889 | -4.967266 | 1 |
| 8 | 1.781583 | 1.314664 | 4.248690 | -0.518293 | -0.149044 | 0.033082 | -1.087893 | -3.117561 | 0.624935 | 1.567455 | -0.415122 | -1.400792 | 2.607063 | -1.023519 | -2.877902 | -4.524080 | -4.353952 | 0.106859 | 1.298601 | -3.595654 | -5.409204 | 0.633421 | -3.043436 | 0.965268 | -0.266332 | 4.670862 | 1.846717 | -2.320822 | -1.317705 | -0.681722 | 3.280787 | 1.611014 | 2.951390 | -1.862016 | 4.389598 | 1.371300 | -2.516235 | 0.770496 | 0.831132 | -2.310953 | 0 |
| 9 | -0.894140 | 4.011498 | 5.251902 | 3.320747 | 0.727067 | -4.771070 | 1.031232 | 3.632080 | -1.391444 | -1.966746 | -4.779273 | 6.616781 | -0.147815 | -2.513234 | 0.734111 | 0.474710 | 5.085254 | -2.360998 | 4.561398 | 2.287065 | -2.307024 | -0.948690 | -0.300906 | 2.546197 | 0.738320 | 4.266330 | -4.144926 | -0.012559 | -1.469495 | -2.003484 | 1.680064 | -0.635742 | -4.449139 | 2.296340 | 1.575110 | 1.376268 | 0.596757 | -1.413652 | 0.543871 | 0.035020 | 0 |
test_data_copy.head(10)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.613489 | -3.819640 | 2.202302 | 1.300420 | -1.184929 | -4.495964 | -1.835817 | 4.722989 | 1.206140 | -0.341909 | -5.122874 | 1.017021 | 4.818549 | 3.269001 | -2.984330 | 1.387370 | 2.032002 | -0.511587 | -1.023069 | 7.338733 | -2.242244 | 0.155489 | 2.053786 | -2.772273 | 1.851369 | -1.788696 | -0.277282 | -1.255143 | -3.832886 | -1.504542 | 1.586765 | 2.291204 | -5.411388 | 0.870073 | 0.574479 | 4.157191 | 1.428093 | -10.511342 | 0.454664 | -1.448363 | 0 |
| 1 | 0.389608 | -0.512341 | 0.527053 | -2.576776 | -1.016766 | 2.235112 | -0.441301 | -4.405744 | -0.332869 | 1.966794 | 1.796544 | 0.410490 | 0.638328 | -1.389600 | -1.883410 | -5.017922 | -3.827238 | 2.418060 | 1.762285 | -3.242297 | -3.192960 | 1.857454 | -1.707954 | 0.633444 | -0.587898 | 0.083683 | 3.013935 | -0.182309 | 0.223917 | 0.865228 | -1.782158 | -2.474936 | 2.493582 | 0.315165 | 2.059288 | 0.683859 | -0.485452 | 5.128350 | 1.720744 | -1.488235 | 0 |
| 2 | -0.874861 | -0.640632 | 4.084202 | -1.590454 | 0.525855 | -1.957592 | -0.695367 | 1.347309 | -1.732348 | 0.466500 | -4.928214 | 3.565070 | -0.449329 | -0.656246 | -0.166537 | -1.630207 | 2.291865 | 2.396492 | 0.601278 | 1.793534 | -2.120238 | 0.481968 | -0.840707 | 1.790197 | 1.874395 | 0.363930 | -0.169063 | -0.483832 | -2.118982 | -2.156586 | 2.907291 | -1.318888 | -2.997464 | 0.459664 | 0.619774 | 5.631504 | 1.323512 | -1.752154 | 1.808302 | 1.675748 | 0 |
| 3 | 0.238384 | 1.458607 | 4.014528 | 2.534478 | 1.196987 | -3.117330 | -0.924035 | 0.269493 | 1.322436 | 0.702345 | -5.578345 | -0.850662 | 2.590525 | 0.767418 | -2.390809 | -2.341961 | 0.571875 | -0.933751 | 0.508677 | 1.210715 | -3.259524 | 0.104587 | -0.658875 | 1.498107 | 1.100305 | 4.142988 | -0.248446 | -1.136516 | -5.355810 | -4.545931 | 3.808667 | 3.517918 | -3.074085 | -0.284220 | 0.954576 | 3.029331 | -1.367198 | -3.412140 | 0.906000 | -2.450889 | 0 |
| 4 | 5.828225 | 2.768260 | -1.234530 | 2.809264 | -1.641648 | -1.406698 | 0.568643 | 0.965043 | 1.918379 | -2.774855 | -0.530016 | 1.374544 | -0.650941 | -1.679466 | -0.379220 | -4.443143 | 3.893857 | -0.607640 | 2.944931 | 0.367233 | -5.789081 | 4.597528 | 4.450264 | 3.224941 | 0.396701 | 0.247765 | -2.362047 | 1.079378 | -0.473076 | 2.242810 | -3.591421 | 1.773841 | -1.501573 | -2.226702 | 4.776830 | -6.559698 | -0.805551 | -0.276007 | -3.858207 | -0.537694 | 0 |
| 5 | -1.885713 | -1.964160 | 0.245667 | -1.187255 | 0.027369 | -2.214094 | -0.605558 | 3.434368 | -2.366542 | 0.238592 | -2.421572 | 5.443762 | 1.621775 | 0.403306 | -2.084618 | 0.838689 | 1.168480 | 2.018892 | 0.968280 | 1.563233 | -2.037208 | 1.807605 | 4.216895 | 2.806480 | -0.692462 | -1.224669 | -1.904662 | -0.416139 | -1.203544 | 1.657175 | 0.658367 | 3.481445 | -1.241937 | 0.165481 | 1.938014 | 3.174898 | 1.513700 | -2.634459 | 0.694483 | -0.169500 | 0 |
| 6 | -1.836429 | 1.216661 | -0.186460 | 0.232731 | 1.752135 | -1.982141 | 0.637039 | 3.654029 | -2.891643 | -0.882726 | -2.881859 | 5.532344 | -1.843551 | -0.994694 | 0.602109 | 1.870065 | 3.930774 | 1.278002 | 1.110149 | 0.088251 | 0.226533 | 1.064542 | 4.210596 | 5.268233 | -0.754587 | 0.433090 | -4.173879 | 0.675818 | -0.654066 | 0.612422 | 1.253968 | 3.697508 | -1.371313 | -0.267922 | 0.385374 | 1.392039 | 1.195155 | 0.104975 | -0.258228 | 1.581771 | 0 |
| 7 | -1.649117 | 0.646787 | 2.657947 | 1.395099 | 0.725959 | 0.305211 | -1.877257 | -3.814487 | 2.273639 | 0.434063 | -2.533155 | -3.581302 | 1.480436 | -0.453753 | -3.334392 | -4.882117 | -0.657244 | 1.311949 | -0.487993 | 0.744801 | -2.215417 | -0.555530 | -3.758861 | -0.623616 | 0.436324 | 2.663326 | 0.024642 | -0.514574 | -1.836844 | -2.277864 | -0.105820 | -1.082314 | 0.530939 | -0.290736 | -0.219059 | 1.364707 | -0.565783 | 0.605945 | 1.772588 | -1.977966 | 0 |
| 8 | -2.744431 | -5.870927 | 1.169155 | -1.586454 | -2.215360 | -3.561773 | -2.037385 | 2.782849 | -0.687223 | 1.527678 | -4.668574 | 5.722978 | 5.746367 | 2.352832 | -5.336724 | -2.039904 | 0.340710 | 3.045392 | 1.603427 | 6.519083 | -4.491886 | 2.848353 | 3.717535 | -1.080943 | 0.840454 | -4.229435 | 1.391155 | -0.403130 | -4.380458 | -0.042646 | -2.192626 | 0.172074 | -5.489960 | 3.224386 | 1.433453 | 6.421956 | 3.016854 | -5.953351 | 3.084918 | -2.982987 | 0 |
| 9 | -0.247320 | -1.130009 | 4.584899 | 0.051528 | 0.044828 | -2.527062 | -1.643095 | 1.042020 | -0.059002 | 0.751700 | -4.915543 | 0.709726 | 1.810841 | 0.466331 | -2.012662 | -2.139615 | 0.767581 | 1.047640 | 0.299762 | 2.506561 | -3.523522 | 0.229476 | -1.363760 | 0.667588 | 1.640188 | 1.301918 | 0.040996 | -1.302753 | -2.962745 | -2.077191 | 3.567099 | 1.133185 | -2.171650 | -0.245509 | 2.071918 | 4.719610 | 0.033392 | -4.396785 | 1.221412 | -0.531737 | 0 |
#Check the initial shape of the train and test data
print("Train Data Shape:", train_data_copy.shape)
print("Test Data Shape:", test_data_copy.shape)
Train Data Shape: (20000, 41) Test Data Shape: (5000, 41)
#Print the info of the train and test data
print("Train Data Info:")
train_data_copy.info()
print("\nTest Data Info:")
test_data_copy.info()
Train Data Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 19982 non-null float64 1 V2 19982 non-null float64 2 V3 20000 non-null float64 3 V4 20000 non-null float64 4 V5 20000 non-null float64 5 V6 20000 non-null float64 6 V7 20000 non-null float64 7 V8 20000 non-null float64 8 V9 20000 non-null float64 9 V10 20000 non-null float64 10 V11 20000 non-null float64 11 V12 20000 non-null float64 12 V13 20000 non-null float64 13 V14 20000 non-null float64 14 V15 20000 non-null float64 15 V16 20000 non-null float64 16 V17 20000 non-null float64 17 V18 20000 non-null float64 18 V19 20000 non-null float64 19 V20 20000 non-null float64 20 V21 20000 non-null float64 21 V22 20000 non-null float64 22 V23 20000 non-null float64 23 V24 20000 non-null float64 24 V25 20000 non-null float64 25 V26 20000 non-null float64 26 V27 20000 non-null float64 27 V28 20000 non-null float64 28 V29 20000 non-null float64 29 V30 20000 non-null float64 30 V31 20000 non-null float64 31 V32 20000 non-null float64 32 V33 20000 non-null float64 33 V34 20000 non-null float64 34 V35 20000 non-null float64 35 V36 20000 non-null float64 36 V37 20000 non-null float64 37 V38 20000 non-null float64 38 V39 20000 non-null float64 39 V40 20000 non-null float64 40 Target 20000 non-null int64 dtypes: float64(40), int64(1) memory usage: 6.3 MB Test Data Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 4995 non-null float64 1 V2 4994 non-null float64 2 V3 5000 non-null float64 3 V4 5000 non-null float64 4 V5 5000 non-null float64 5 V6 5000 non-null float64 6 V7 5000 non-null float64 7 V8 5000 non-null float64 8 V9 5000 non-null float64 9 V10 5000 non-null float64 10 V11 5000 non-null float64 11 V12 5000 non-null float64 12 V13 5000 non-null float64 13 V14 5000 non-null float64 14 V15 5000 non-null float64 15 V16 5000 non-null float64 16 V17 5000 non-null float64 17 V18 5000 non-null float64 18 V19 5000 non-null float64 19 V20 5000 non-null float64 20 V21 5000 non-null float64 21 V22 5000 non-null float64 22 V23 5000 non-null float64 23 V24 5000 non-null float64 24 V25 5000 non-null float64 25 V26 5000 non-null float64 26 V27 5000 non-null float64 27 V28 5000 non-null float64 28 V29 5000 non-null float64 29 V30 5000 non-null float64 30 V31 5000 non-null float64 31 V32 5000 non-null float64 32 V33 5000 non-null float64 33 V34 5000 non-null float64 34 V35 5000 non-null float64 35 V36 5000 non-null float64 36 V37 5000 non-null float64 37 V38 5000 non-null float64 38 V39 5000 non-null float64 39 V40 5000 non-null float64 40 Target 5000 non-null int64 dtypes: float64(40), int64(1) memory usage: 1.6 MB
- The 40 input parameters V1-V40 are all floating point values
- The 1 output parameter Target is an integer
#Check for missing values in the train and test data
print("\nMissing Values in Train Data:\n", train_data_copy.isnull().sum())
print("\nMissing Values in Test Data:\n", test_data_copy.isnull().sum())
#Print the total number of missing values in the train and test data
print("\nTotal Missing Values in Train Data:", train_data_copy.isnull().sum().sum())
print("Total Missing Values in Test Data:", test_data_copy.isnull().sum().sum())
Missing Values in Train Data: V1 18 V2 18 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64 Missing Values in Test Data: V1 5 V2 6 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64 Total Missing Values in Train Data: 36 Total Missing Values in Test Data: 11
- There are a total of 36 missing values in the training data set.
- There are total of 11 missing values in the test data set.
- Since we need to avoid data leaking we have to split the training set into training and validation sets and then apply the data treatment on training, validation, and test sets separately.
- After splitting the training data into training and validation sets we confirm that proportion of 0 to 1 is 95% to 5% respectively.
#Check for duplicate rows in the training data
duplicates_train = train_data_copy.duplicated().sum()
print("\nDuplicate Rows in Train Data:", duplicates_train)
Duplicate Rows in Train Data: 0
# Look at the description of the training data
train_data_copy.describe()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 19982.000000 | 19982.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 |
| mean | -0.271996 | 0.440430 | 2.484699 | -0.083152 | -0.053752 | -0.995443 | -0.879325 | -0.548195 | -0.016808 | -0.012998 | -1.895393 | 1.604825 | 1.580486 | -0.950632 | -2.414993 | -2.925225 | -0.134261 | 1.189347 | 1.181808 | 0.023608 | -3.611252 | 0.951835 | -0.366116 | 1.134389 | -0.002186 | 1.873785 | -0.612413 | -0.883218 | -0.985625 | -0.015534 | 0.486842 | 0.303799 | 0.049825 | -0.462702 | 2.229620 | 1.514809 | 0.011316 | -0.344025 | 0.890653 | -0.875630 | 0.055500 |
| std | 3.441625 | 3.150784 | 3.388963 | 3.431595 | 2.104801 | 2.040970 | 1.761626 | 3.295756 | 2.160568 | 2.193201 | 3.124322 | 2.930454 | 2.874658 | 1.789651 | 3.354974 | 4.221717 | 3.345462 | 2.592276 | 3.396925 | 3.669477 | 3.567690 | 1.651547 | 4.031860 | 3.912069 | 2.016740 | 3.435137 | 4.368847 | 1.917713 | 2.684365 | 3.005258 | 3.461384 | 5.500400 | 3.575285 | 3.183841 | 2.937102 | 3.800860 | 1.788165 | 3.948147 | 1.753054 | 3.012155 | 0.228959 |
| min | -11.876451 | -12.319951 | -10.708139 | -15.082052 | -8.603361 | -10.227147 | -7.949681 | -15.657561 | -8.596313 | -9.853957 | -14.832058 | -12.948007 | -13.228247 | -7.738593 | -16.416606 | -20.374158 | -14.091184 | -11.643994 | -13.491784 | -13.922659 | -17.956231 | -10.122095 | -14.866128 | -16.387147 | -8.228266 | -11.834271 | -14.904939 | -9.269489 | -12.579469 | -14.796047 | -13.722760 | -19.876502 | -16.898353 | -17.985094 | -15.349803 | -14.833178 | -5.478350 | -17.375002 | -6.438880 | -11.023935 | 0.000000 |
| 25% | -2.737146 | -1.640674 | 0.206860 | -2.347660 | -1.535607 | -2.347238 | -2.030926 | -2.642665 | -1.494973 | -1.411212 | -3.922404 | -0.396514 | -0.223545 | -2.170741 | -4.415322 | -5.634240 | -2.215611 | -0.403917 | -1.050168 | -2.432953 | -5.930360 | -0.118127 | -3.098756 | -1.468062 | -1.365178 | -0.337863 | -3.652323 | -2.171218 | -2.787443 | -1.867114 | -1.817772 | -3.420469 | -2.242857 | -2.136984 | 0.336191 | -0.943809 | -1.255819 | -2.987638 | -0.272250 | -2.940193 | 0.000000 |
| 50% | -0.747917 | 0.471536 | 2.255786 | -0.135241 | -0.101952 | -1.000515 | -0.917179 | -0.389085 | -0.067597 | 0.100973 | -1.921237 | 1.507841 | 1.637185 | -0.957163 | -2.382617 | -2.682705 | -0.014580 | 0.883398 | 1.279061 | 0.033415 | -3.532888 | 0.974687 | -0.262093 | 0.969048 | 0.025050 | 1.950531 | -0.884894 | -0.891073 | -1.176181 | 0.184346 | 0.490304 | 0.052073 | -0.066249 | -0.255008 | 2.098633 | 1.566526 | -0.128435 | -0.316849 | 0.919261 | -0.920806 | 0.000000 |
| 75% | 1.840112 | 2.543967 | 4.566165 | 2.130615 | 1.340480 | 0.380330 | 0.223695 | 1.722965 | 1.409203 | 1.477045 | 0.118906 | 3.571454 | 3.459886 | 0.270677 | -0.359052 | -0.095046 | 2.068751 | 2.571770 | 3.493299 | 2.512372 | -1.265884 | 2.025594 | 2.451750 | 3.545975 | 1.397112 | 4.130037 | 2.189177 | 0.375884 | 0.629773 | 2.036229 | 2.730688 | 3.761722 | 2.255134 | 1.436935 | 4.064358 | 3.983939 | 1.175533 | 2.279399 | 2.057540 | 1.119897 | 0.000000 |
| max | 15.493002 | 13.089269 | 17.090919 | 13.236381 | 8.133797 | 6.975847 | 8.006091 | 11.679495 | 8.137580 | 8.108472 | 11.826433 | 15.080698 | 15.419616 | 5.670664 | 12.246455 | 13.583212 | 16.756432 | 13.179863 | 13.237742 | 16.052339 | 13.840473 | 7.409856 | 14.458734 | 17.163291 | 8.223389 | 16.836410 | 17.560404 | 6.527643 | 10.722055 | 12.505812 | 17.255090 | 23.633187 | 16.692486 | 14.358213 | 15.291065 | 19.329576 | 7.467006 | 15.289923 | 7.759877 | 10.654265 | 1.000000 |
- Since the input data are redacted and we don't have any information about the parameters, it is difficult to read much into anything other than the fact that there are missing data items which need to be imputed for V1 and V2. There isn't enough information available to know for example if negatives values on any of the 40 measurements are plausible.
Exploratory Data Analysis¶
Univariate analysis¶
#Get the list of all columns in train_data_copy
all_columns = train_data_copy.columns.tolist()
#Since there are a total of 40 input variables, and 1 output variable we will perform a loop of the histogram plot and box plot for each of the input variables
for column in all_columns:
draw_boxplot_and_histogram(train_data_copy, column)
plt.show()
- The input variables V1-V40 are all more or less normally distributed.
- The input variables V1-V40 all have outliers based on the box plot diagram.
- The Target variable does not exhibit any sort of normal distribution at all
Bivariate Analysis¶
Histogram¶
draw_default_correlation_matrix(train_data_copy)
#Identify all the variables that have strong correlations with each other
# Note: The threshold for strong correlation is set to 0.5
strong_correlations = train_data_copy.corr().abs() > THRESHOLD
strong_correlations = strong_correlations.where(np.triu(np.ones(strong_correlations.shape), k=1).astype(bool))
strong_correlations = strong_correlations.stack().reset_index()
strong_correlations.columns = ['Variable1', 'Variable2', 'Correlation']
strong_correlations = strong_correlations[strong_correlations['Correlation'] == True]
print("Strong Correlations:\n", strong_correlations)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| V1 | 1.000000 | 0.313593 | 0.388855 | -0.294832 | -0.516085 | 0.175518 | 0.480694 | -0.361016 | 0.398778 | -0.022043 | 0.291310 | -0.144904 | 0.056089 | -0.273683 | 0.413259 | -0.334787 | -0.347788 | -0.390056 | 0.127776 | -0.341331 | -0.392011 | 0.207272 | -0.436782 | -0.512832 | 0.675602 | 0.222721 | 0.683816 | -0.280558 | -0.062311 | -0.464534 | 0.084862 | -0.633279 | -0.289917 | -0.019433 | 0.142695 | -0.124949 | -0.350610 | 0.148316 | -0.120898 | 0.212632 | 0.073307 |
| V2 | 0.313593 | 1.000000 | 0.095185 | 0.290202 | 0.383785 | 0.233949 | 0.455632 | -0.383237 | 0.280601 | -0.483879 | 0.158944 | -0.159064 | -0.381918 | -0.853530 | 0.221681 | -0.241576 | 0.164679 | -0.303630 | 0.119098 | -0.589420 | -0.064819 | -0.096020 | -0.181389 | 0.221934 | -0.127100 | 0.787440 | -0.204437 | 0.032709 | 0.179821 | -0.216071 | -0.042449 | -0.115820 | 0.203387 | -0.281097 | -0.054777 | -0.580675 | -0.437349 | 0.655368 | -0.350539 | 0.155617 | -0.000946 |
| V3 | 0.388855 | 0.095185 | 1.000000 | -0.028828 | -0.359628 | -0.291644 | -0.156267 | -0.412009 | 0.233626 | 0.446161 | -0.334642 | -0.166270 | 0.329552 | -0.222967 | -0.061598 | -0.533497 | -0.413890 | -0.267845 | 0.402367 | -0.039521 | -0.658327 | -0.194010 | -0.785580 | -0.265330 | 0.595676 | 0.459761 | 0.499957 | -0.411772 | -0.445440 | -0.607322 | 0.463729 | -0.367431 | -0.219509 | 0.225753 | 0.501409 | 0.438341 | -0.502482 | -0.073865 | 0.527742 | -0.306190 | -0.213855 |
| V4 | -0.294832 | 0.290202 | -0.028828 | 1.000000 | 0.084185 | -0.470199 | -0.196909 | 0.034906 | 0.265154 | -0.107058 | -0.363612 | -0.235211 | -0.272949 | -0.221931 | -0.150716 | -0.194471 | 0.606701 | -0.413616 | 0.596391 | 0.412007 | -0.085815 | -0.033303 | 0.036908 | 0.516096 | -0.269900 | 0.106663 | -0.588490 | 0.660283 | -0.186060 | 0.121724 | -0.368177 | 0.383456 | -0.052216 | 0.297496 | 0.340764 | -0.557958 | -0.356650 | 0.090986 | -0.389080 | -0.665310 | 0.110786 |
| V5 | -0.516085 | 0.383785 | -0.359628 | 0.084185 | 1.000000 | 0.156161 | -0.078436 | 0.168267 | -0.297635 | -0.343741 | -0.212215 | -0.018023 | -0.333497 | -0.146212 | -0.146589 | 0.266994 | 0.328192 | 0.432620 | -0.504478 | -0.360510 | 0.383959 | -0.089915 | 0.456634 | 0.662638 | -0.602529 | 0.405462 | -0.662801 | -0.034228 | 0.093092 | 0.141055 | 0.301930 | 0.619779 | 0.458888 | -0.607112 | -0.341275 | -0.045510 | 0.064515 | 0.171836 | -0.217778 | 0.335332 | -0.100525 |
| V6 | 0.175518 | 0.233949 | -0.291644 | -0.470199 | 0.156161 | 1.000000 | 0.210914 | -0.559084 | 0.084554 | -0.116887 | 0.710480 | -0.395911 | -0.229053 | -0.346696 | 0.145335 | -0.084184 | -0.454301 | 0.286163 | -0.418803 | -0.695236 | 0.223402 | -0.068138 | -0.186560 | -0.198847 | -0.190472 | 0.147217 | 0.217310 | -0.182653 | 0.586885 | 0.153105 | -0.115764 | -0.292468 | 0.587371 | -0.401306 | -0.317068 | -0.247402 | -0.067280 | 0.628722 | -0.025458 | 0.423882 | 0.000237 |
| V7 | 0.480694 | 0.455632 | -0.156267 | -0.196909 | -0.078436 | 0.210914 | 1.000000 | 0.092713 | -0.244421 | -0.176849 | 0.530656 | 0.397331 | -0.493828 | -0.323227 | 0.867871 | 0.401290 | 0.027866 | -0.561630 | 0.135626 | -0.413238 | 0.470703 | -0.277780 | -0.050080 | -0.210543 | 0.201945 | 0.023205 | 0.289410 | 0.189329 | 0.311074 | -0.250387 | -0.034001 | -0.458844 | -0.267082 | 0.232444 | -0.438937 | -0.280033 | -0.007940 | 0.469057 | -0.236700 | 0.479354 | 0.236907 |
| V8 | -0.361016 | -0.383237 | -0.412009 | 0.034906 | 0.168267 | -0.559084 | 0.092713 | 1.000000 | -0.611735 | -0.179856 | -0.193942 | 0.674030 | -0.104306 | 0.545237 | 0.176102 | 0.802505 | 0.514604 | -0.025594 | -0.151573 | 0.440875 | 0.484455 | -0.049784 | 0.717858 | 0.250453 | -0.202921 | -0.502977 | -0.418680 | 0.172116 | 0.058335 | 0.360815 | 0.149730 | 0.471798 | -0.251002 | 0.054575 | -0.167235 | 0.155913 | 0.522797 | -0.614964 | -0.344376 | 0.256984 | 0.135996 |
| V9 | 0.398778 | 0.280601 | 0.233626 | 0.265154 | -0.297635 | 0.084554 | -0.244421 | -0.611735 | 1.000000 | -0.293048 | -0.090165 | -0.629641 | 0.391041 | -0.238225 | -0.393782 | -0.752931 | -0.040711 | -0.045757 | 0.041111 | 0.126822 | -0.596978 | 0.318376 | -0.344270 | -0.390001 | 0.314500 | 0.316018 | 0.173953 | -0.099764 | -0.235584 | -0.282570 | -0.458116 | -0.369241 | -0.137518 | -0.102147 | 0.117301 | -0.351657 | -0.200198 | 0.038099 | 0.001001 | -0.308607 | 0.008124 |
| V10 | -0.022043 | -0.483879 | 0.446161 | -0.107058 | -0.343741 | -0.116887 | -0.176849 | -0.179856 | -0.293048 | 1.000000 | -0.156346 | -0.119735 | 0.068491 | 0.343507 | 0.080268 | -0.117692 | -0.509791 | -0.144308 | 0.414314 | 0.056561 | -0.125682 | -0.174525 | -0.458001 | -0.001608 | 0.276199 | -0.223186 | 0.502179 | 0.033785 | -0.434624 | -0.314250 | 0.403991 | -0.018839 | -0.110081 | 0.514227 | 0.346129 | 0.561161 | -0.403067 | -0.007520 | 0.560471 | -0.474803 | -0.051263 |
| V11 | 0.291310 | 0.158944 | -0.334642 | -0.363612 | -0.212215 | 0.710480 | 0.530656 | -0.193942 | -0.090165 | -0.156346 | 1.000000 | -0.004893 | -0.178650 | -0.275305 | 0.411599 | 0.276387 | -0.437216 | -0.244070 | -0.093482 | -0.535816 | 0.336133 | -0.261462 | -0.111463 | -0.377024 | -0.217322 | -0.057829 | 0.252830 | -0.087538 | 0.811228 | 0.371119 | -0.231534 | -0.363715 | 0.422481 | -0.062297 | -0.169441 | -0.427396 | -0.045318 | 0.471418 | -0.239641 | 0.336767 | 0.196715 |
| V12 | -0.144904 | -0.159064 | -0.166270 | -0.235211 | -0.018023 | -0.395911 | 0.397331 | 0.674030 | -0.629641 | -0.119735 | -0.004893 | 1.000000 | -0.010808 | 0.130274 | 0.264978 | 0.562087 | 0.243546 | -0.047265 | 0.064590 | 0.147514 | 0.281808 | 0.016347 | 0.454646 | 0.036910 | -0.052367 | -0.297273 | -0.083374 | 0.043959 | 0.080279 | 0.137589 | 0.033878 | 0.008758 | -0.355644 | 0.237594 | -0.194636 | 0.242611 | 0.548449 | -0.206345 | 0.053823 | 0.307470 | -0.021807 |
| V13 | 0.056089 | -0.381918 | 0.329552 | -0.272949 | -0.333497 | -0.229053 | -0.493828 | -0.104306 | 0.391041 | 0.068491 | -0.178650 | -0.010808 | 1.000000 | 0.367444 | -0.684788 | -0.314649 | -0.458437 | 0.255678 | -0.198617 | 0.226223 | -0.607169 | 0.277293 | -0.055490 | -0.492305 | 0.160221 | 0.084519 | 0.301270 | -0.657255 | -0.339866 | -0.079904 | 0.024917 | -0.102412 | -0.111937 | -0.112675 | 0.224130 | 0.422558 | 0.168500 | -0.559274 | 0.486516 | -0.258869 | -0.139718 |
| V14 | -0.273683 | -0.853530 | -0.222967 | -0.221931 | -0.146212 | -0.346696 | -0.323227 | 0.545237 | -0.238225 | 0.343507 | -0.275305 | 0.130274 | 0.367444 | 1.000000 | -0.157487 | 0.404744 | -0.030553 | 0.220037 | -0.302476 | 0.550226 | 0.208446 | 0.092585 | 0.383793 | -0.149375 | 0.115315 | -0.674520 | 0.118412 | -0.054306 | -0.353208 | 0.044626 | 0.172368 | 0.276829 | -0.321509 | 0.142082 | -0.161742 | 0.547594 | 0.422413 | -0.762684 | 0.184663 | -0.053814 | 0.117586 |
| V15 | 0.413259 | 0.221681 | -0.061598 | -0.150716 | -0.146589 | 0.145335 | 0.867871 | 0.176102 | -0.393782 | 0.080268 | 0.411599 | 0.264978 | -0.684788 | -0.157487 | 1.000000 | 0.470695 | 0.073532 | -0.590202 | 0.213037 | -0.245100 | 0.567302 | -0.439281 | -0.179441 | -0.138497 | 0.333670 | -0.191763 | 0.314699 | 0.323758 | 0.280440 | -0.252483 | 0.138395 | -0.394717 | -0.287010 | 0.364587 | -0.337384 | -0.133556 | -0.084773 | 0.347191 | -0.238364 | 0.438196 | 0.249118 |
| V16 | -0.334787 | -0.241576 | -0.533497 | -0.194471 | 0.266994 | -0.084184 | 0.401290 | 0.802505 | -0.752931 | -0.117692 | 0.276387 | 0.562087 | -0.314649 | 0.404744 | 0.470695 | 1.000000 | 0.216777 | -0.130780 | -0.279313 | 0.036503 | 0.836527 | -0.437891 | 0.532444 | 0.102879 | -0.321164 | -0.426965 | -0.252718 | 0.122186 | 0.416053 | 0.359406 | 0.210411 | 0.289538 | 0.014966 | 0.040945 | -0.447736 | 0.080209 | 0.430631 | -0.235650 | -0.287419 | 0.477094 | 0.230507 |
| V17 | -0.347788 | 0.164679 | -0.413890 | 0.606701 | 0.328192 | -0.454301 | 0.027866 | 0.514604 | -0.040711 | -0.509791 | -0.437216 | 0.243546 | -0.458437 | -0.030553 | 0.073532 | 0.216777 | 1.000000 | -0.019526 | 0.092045 | 0.511172 | 0.303739 | 0.142651 | 0.535777 | 0.492579 | -0.169761 | -0.216592 | -0.706598 | 0.659308 | -0.059505 | 0.171504 | -0.363612 | 0.344394 | -0.307117 | 0.070968 | -0.213708 | -0.374603 | 0.348304 | -0.131721 | -0.511091 | 0.038264 | 0.085314 |
| V18 | -0.390056 | -0.303630 | -0.267845 | -0.413616 | 0.432620 | 0.286163 | -0.561630 | -0.025594 | -0.045757 | -0.144308 | -0.244070 | -0.047265 | 0.255678 | 0.220037 | -0.590202 | -0.130780 | -0.019526 | 1.000000 | -0.693428 | -0.034312 | -0.082599 | 0.466859 | 0.396764 | 0.169876 | -0.251902 | -0.064678 | -0.215565 | -0.325600 | -0.039814 | 0.235128 | 0.048011 | 0.287090 | 0.314099 | -0.602106 | -0.190625 | 0.356522 | 0.479581 | -0.189250 | 0.225480 | 0.286388 | -0.293340 |
| V19 | 0.127776 | 0.119098 | 0.402367 | 0.596391 | -0.504478 | -0.418803 | 0.135626 | -0.151573 | 0.041111 | 0.414314 | -0.093482 | 0.064590 | -0.198617 | -0.302476 | 0.213037 | -0.279313 | 0.092045 | -0.693428 | 1.000000 | 0.246063 | -0.267267 | -0.111111 | -0.428241 | 0.136032 | 0.173524 | -0.024596 | 0.095771 | 0.529055 | -0.213289 | -0.138990 | -0.219658 | -0.147915 | -0.286116 | 0.756188 | 0.553275 | -0.240156 | -0.505795 | 0.266142 | 0.031867 | -0.699379 | 0.053897 |
| V20 | -0.341331 | -0.589420 | -0.039521 | 0.412007 | -0.360510 | -0.695236 | -0.413238 | 0.440875 | 0.126822 | 0.056561 | -0.535816 | 0.147514 | 0.226223 | 0.550226 | -0.245100 | 0.036503 | 0.511172 | -0.034312 | 0.246063 | 1.000000 | -0.047137 | 0.116157 | 0.201459 | -0.069796 | 0.208989 | -0.623512 | -0.179680 | 0.413681 | -0.376122 | 0.065311 | -0.367301 | 0.096048 | -0.580161 | 0.503773 | 0.059147 | 0.143676 | 0.426708 | -0.646999 | 0.079338 | -0.412649 | 0.070803 |
| V21 | -0.392011 | -0.064819 | -0.658327 | -0.085815 | 0.383959 | 0.223402 | 0.470703 | 0.484455 | -0.596978 | -0.125682 | 0.336133 | 0.281808 | -0.607169 | 0.208446 | 0.567302 | 0.836527 | 0.303739 | -0.082599 | -0.267267 | -0.047137 | 1.000000 | -0.507400 | 0.381939 | 0.143596 | -0.321902 | -0.410914 | -0.258423 | 0.406565 | 0.453329 | 0.229215 | -0.021029 | 0.142997 | 0.054475 | 0.111245 | -0.700400 | -0.102281 | 0.386868 | 0.155868 | -0.242004 | 0.470061 | 0.256411 |
| V22 | 0.207272 | -0.096020 | -0.194010 | -0.033303 | -0.089915 | -0.068138 | -0.277780 | -0.049784 | 0.318376 | -0.174525 | -0.261462 | 0.016347 | 0.277293 | 0.092585 | -0.439281 | -0.437891 | 0.142651 | 0.466859 | -0.111111 | 0.116157 | -0.507400 | 1.000000 | 0.421247 | 0.158724 | 0.035946 | -0.104899 | -0.048791 | -0.015530 | -0.343337 | 0.100537 | -0.294896 | 0.155642 | -0.122773 | -0.312228 | 0.227197 | -0.091865 | 0.145478 | -0.158051 | -0.173739 | -0.093564 | -0.134727 |
| V23 | -0.436782 | -0.181389 | -0.785580 | 0.036908 | 0.456634 | -0.186560 | -0.050080 | 0.717858 | -0.344270 | -0.458001 | -0.111463 | 0.454646 | -0.055490 | 0.383793 | -0.179441 | 0.532444 | 0.535777 | 0.396764 | -0.428241 | 0.201459 | 0.381939 | 0.421247 | 1.000000 | 0.442574 | -0.546003 | -0.345841 | -0.628294 | 0.177058 | 0.102085 | 0.541924 | -0.166827 | 0.633804 | 0.051773 | -0.355795 | -0.275453 | -0.136416 | 0.571951 | -0.341177 | -0.478436 | 0.268313 | 0.071042 |
| V24 | -0.512832 | 0.221934 | -0.265330 | 0.516096 | 0.662638 | -0.198847 | -0.210543 | 0.250453 | -0.390001 | -0.001608 | -0.377024 | 0.036910 | -0.492305 | -0.149375 | -0.138497 | 0.102879 | 0.492579 | 0.169876 | 0.136032 | -0.069796 | 0.143596 | 0.158724 | 0.442574 | 1.000000 | -0.613548 | 0.156184 | -0.755335 | 0.408712 | -0.084531 | 0.321787 | 0.158342 | 0.825119 | 0.359401 | -0.220454 | 0.249089 | -0.210090 | -0.243950 | 0.168662 | -0.401241 | -0.197865 | -0.091242 |
| V25 | 0.675602 | -0.127100 | 0.595676 | -0.269900 | -0.602529 | -0.190472 | 0.201945 | -0.202921 | 0.314500 | 0.276199 | -0.217322 | -0.052367 | 0.160221 | 0.115315 | 0.333670 | -0.321164 | -0.169761 | -0.251902 | 0.173524 | 0.208989 | -0.321902 | 0.035946 | -0.546003 | -0.613548 | 1.000000 | -0.108421 | 0.766255 | -0.138697 | -0.469705 | -0.764734 | 0.145528 | -0.711082 | -0.735157 | 0.373514 | -0.039913 | 0.392944 | -0.029111 | -0.194906 | 0.370801 | 0.078735 | -0.001440 |
| V26 | 0.222721 | 0.787440 | 0.459761 | 0.106663 | 0.405462 | 0.147217 | 0.023205 | -0.502977 | 0.316018 | -0.223186 | -0.057829 | -0.297273 | 0.084519 | -0.674520 | -0.191763 | -0.426965 | -0.216592 | -0.064678 | -0.024596 | -0.623512 | -0.410914 | -0.104899 | -0.345841 | 0.156184 | -0.108421 | 1.000000 | -0.079879 | -0.453507 | -0.048861 | -0.296300 | 0.331466 | 0.011615 | 0.367059 | -0.460704 | 0.207572 | -0.149467 | -0.559232 | 0.376878 | 0.018593 | -0.002513 | -0.180469 |
| V27 | 0.683816 | -0.204437 | 0.499957 | -0.588490 | -0.662801 | 0.217310 | 0.289410 | -0.418680 | 0.173953 | 0.502179 | 0.252830 | -0.083374 | 0.301270 | 0.118412 | 0.314699 | -0.252718 | -0.706598 | -0.215565 | 0.095771 | -0.179680 | -0.258423 | -0.048791 | -0.628294 | -0.755335 | 0.766255 | -0.079879 | 1.000000 | -0.360284 | -0.222870 | -0.603725 | 0.185935 | -0.765733 | -0.380365 | 0.324211 | -0.029566 | 0.424834 | -0.148273 | 0.051612 | 0.542822 | 0.067686 | 0.014891 |
| V28 | -0.280558 | 0.032709 | -0.411772 | 0.660283 | -0.034228 | -0.182653 | 0.189329 | 0.172116 | -0.099764 | 0.033785 | -0.087538 | 0.043959 | -0.657255 | -0.054306 | 0.323758 | 0.122186 | 0.659308 | -0.325600 | 0.529055 | 0.413681 | 0.406565 | -0.015530 | 0.177058 | 0.408712 | -0.138697 | -0.453507 | -0.360284 | 1.000000 | -0.009382 | 0.133319 | -0.547992 | 0.131296 | -0.258229 | 0.561588 | -0.104315 | -0.480046 | 0.015925 | 0.298395 | -0.335742 | -0.326313 | 0.207359 |
| V29 | -0.062311 | 0.179821 | -0.445440 | -0.186060 | 0.093092 | 0.586885 | 0.311074 | 0.058335 | -0.235584 | -0.434624 | 0.811228 | 0.080279 | -0.339866 | -0.353208 | 0.280440 | 0.416053 | -0.059505 | -0.039814 | -0.213289 | -0.376122 | 0.453329 | -0.343337 | 0.102085 | -0.084531 | -0.469705 | -0.048861 | -0.222870 | -0.009382 | 1.000000 | 0.670054 | -0.220486 | -0.079502 | 0.596903 | -0.251588 | -0.155400 | -0.485512 | 0.147663 | 0.334013 | -0.427830 | 0.455882 | 0.108342 |
| V30 | -0.464534 | -0.216071 | -0.607322 | 0.121724 | 0.141055 | 0.153105 | -0.250387 | 0.360815 | -0.282570 | -0.314250 | 0.371119 | 0.137589 | -0.079904 | 0.044626 | -0.252483 | 0.359406 | 0.171504 | 0.235128 | -0.138990 | 0.065311 | 0.229215 | 0.100537 | 0.541924 | 0.321787 | -0.764734 | -0.296300 | -0.603725 | 0.133319 | 0.670054 | 1.000000 | -0.305334 | 0.506868 | 0.611668 | -0.280160 | 0.207840 | -0.405508 | 0.231476 | -0.090686 | -0.508283 | 0.012973 | 0.038867 |
| V31 | 0.084862 | -0.042449 | 0.463729 | -0.368177 | 0.301930 | -0.115764 | -0.034001 | 0.149730 | -0.458116 | 0.403991 | -0.231534 | 0.033878 | 0.024917 | 0.172368 | 0.138395 | 0.210411 | -0.363612 | 0.048011 | -0.219658 | -0.367301 | -0.021029 | -0.294896 | -0.166827 | 0.158342 | 0.145528 | 0.331466 | 0.185935 | -0.547992 | -0.220486 | -0.305334 | 1.000000 | 0.244383 | 0.144410 | -0.276676 | 0.160888 | 0.627929 | -0.319980 | -0.232866 | 0.197549 | 0.245345 | -0.136951 |
| V32 | -0.633279 | -0.115820 | -0.367431 | 0.383456 | 0.619779 | -0.292468 | -0.458844 | 0.471798 | -0.369241 | -0.018839 | -0.363715 | 0.008758 | -0.102412 | 0.276829 | -0.394717 | 0.289538 | 0.344394 | 0.287090 | -0.147915 | 0.096048 | 0.142997 | 0.155642 | 0.633804 | 0.825119 | -0.711082 | 0.011615 | -0.765733 | 0.131296 | -0.079502 | 0.506868 | 0.244383 | 1.000000 | 0.425631 | -0.368878 | 0.252875 | -0.047092 | -0.076385 | -0.273652 | -0.390348 | -0.207478 | -0.032793 |
| V33 | -0.289917 | 0.203387 | -0.219509 | -0.052216 | 0.458888 | 0.587371 | -0.267082 | -0.251002 | -0.137518 | -0.110081 | 0.422481 | -0.355644 | -0.111937 | -0.321509 | -0.287010 | 0.014966 | -0.307117 | 0.314099 | -0.286116 | -0.580161 | 0.054475 | -0.122773 | 0.051773 | 0.359401 | -0.735157 | 0.367059 | -0.380365 | -0.258229 | 0.596903 | 0.611668 | 0.144410 | 0.425631 | 1.000000 | -0.605510 | 0.239364 | -0.279235 | -0.304743 | 0.337791 | -0.247687 | 0.062731 | -0.102548 |
| V34 | -0.019433 | -0.281097 | 0.225753 | 0.297496 | -0.607112 | -0.401306 | 0.232444 | 0.054575 | -0.102147 | 0.514227 | -0.062297 | 0.237594 | -0.112675 | 0.142082 | 0.364587 | 0.040945 | 0.070968 | -0.602106 | 0.756188 | 0.503773 | 0.111245 | -0.312228 | -0.355795 | -0.220454 | 0.373514 | -0.460704 | 0.324211 | 0.561588 | -0.251588 | -0.280160 | -0.276676 | -0.368878 | -0.605510 | 1.000000 | 0.043479 | 0.090631 | -0.029484 | 0.053792 | 0.339965 | -0.490010 | 0.153854 |
| V35 | 0.142695 | -0.054777 | 0.501409 | 0.340764 | -0.341275 | -0.317068 | -0.438937 | -0.167235 | 0.117301 | 0.346129 | -0.169441 | -0.194636 | 0.224130 | -0.161742 | -0.337384 | -0.447736 | -0.213708 | -0.190625 | 0.553275 | 0.059147 | -0.700400 | 0.227197 | -0.275453 | 0.249089 | -0.039913 | 0.207572 | -0.029566 | -0.104315 | -0.155400 | 0.207840 | 0.160888 | 0.252875 | 0.239364 | 0.043479 | 1.000000 | -0.065047 | -0.623487 | -0.124098 | -0.096356 | -0.623920 | -0.145603 |
| V36 | -0.124949 | -0.580675 | 0.438341 | -0.557958 | -0.045510 | -0.247402 | -0.280033 | 0.155913 | -0.351657 | 0.561161 | -0.427396 | 0.242611 | 0.422558 | 0.547594 | -0.133556 | 0.080209 | -0.374603 | 0.356522 | -0.240156 | 0.143676 | -0.102281 | -0.091865 | -0.136416 | -0.210090 | 0.392944 | -0.149467 | 0.424834 | -0.480046 | -0.485512 | -0.405508 | 0.627929 | -0.047092 | -0.279235 | 0.090631 | -0.065047 | 1.000000 | 0.237905 | -0.485314 | 0.751734 | 0.100848 | -0.216453 |
| V37 | -0.350610 | -0.437349 | -0.502482 | -0.356650 | 0.064515 | -0.067280 | -0.007940 | 0.522797 | -0.200198 | -0.403067 | -0.045318 | 0.548449 | 0.168500 | 0.422413 | -0.084773 | 0.430631 | 0.348304 | 0.479581 | -0.505795 | 0.426708 | 0.386868 | 0.145478 | 0.571951 | -0.243950 | -0.029111 | -0.559232 | -0.148273 | 0.015925 | 0.147663 | 0.231476 | -0.319980 | -0.076385 | -0.304743 | -0.029484 | -0.623487 | 0.237905 | 1.000000 | -0.407308 | 0.119262 | 0.472608 | -0.004769 |
| V38 | 0.148316 | 0.655368 | -0.073865 | 0.090986 | 0.171836 | 0.628722 | 0.469057 | -0.614964 | 0.038099 | -0.007520 | 0.471418 | -0.206345 | -0.559274 | -0.762684 | 0.347191 | -0.235650 | -0.131721 | -0.189250 | 0.266142 | -0.646999 | 0.155868 | -0.158051 | -0.341177 | 0.168662 | -0.194906 | 0.376878 | 0.051612 | 0.298395 | 0.334013 | -0.090686 | -0.232866 | -0.273652 | 0.337791 | 0.053792 | -0.124098 | -0.485314 | -0.407308 | 1.000000 | -0.048431 | 0.024597 | 0.003584 |
| V39 | -0.120898 | -0.350539 | 0.527742 | -0.389080 | -0.217778 | -0.025458 | -0.236700 | -0.344376 | 0.001001 | 0.560471 | -0.239641 | 0.053823 | 0.486516 | 0.184663 | -0.238364 | -0.287419 | -0.511091 | 0.225480 | 0.031867 | 0.079338 | -0.242004 | -0.173739 | -0.478436 | -0.401241 | 0.370801 | 0.018593 | 0.542822 | -0.335742 | -0.427830 | -0.508283 | 0.197549 | -0.390348 | -0.247687 | 0.339965 | -0.096356 | 0.751734 | 0.119262 | -0.048431 | 1.000000 | -0.191764 | -0.227264 |
| V40 | 0.212632 | 0.155617 | -0.306190 | -0.665310 | 0.335332 | 0.423882 | 0.479354 | 0.256984 | -0.308607 | -0.474803 | 0.336767 | 0.307470 | -0.258869 | -0.053814 | 0.438196 | 0.477094 | 0.038264 | 0.286388 | -0.699379 | -0.412649 | 0.470061 | -0.093564 | 0.268313 | -0.197865 | 0.078735 | -0.002513 | 0.067686 | -0.326313 | 0.455882 | 0.012973 | 0.245345 | -0.207478 | 0.062731 | -0.490010 | -0.623920 | 0.100848 | 0.472608 | 0.024597 | -0.191764 | 1.000000 | 0.007802 |
| Target | 0.073307 | -0.000946 | -0.213855 | 0.110786 | -0.100525 | 0.000237 | 0.236907 | 0.135996 | 0.008124 | -0.051263 | 0.196715 | -0.021807 | -0.139718 | 0.117586 | 0.249118 | 0.230507 | 0.085314 | -0.293340 | 0.053897 | 0.070803 | 0.256411 | -0.134727 | 0.071042 | -0.091242 | -0.001440 | -0.180469 | 0.014891 | 0.207359 | 0.108342 | 0.038867 | -0.136951 | -0.032793 | -0.102548 | 0.153854 | -0.145603 | -0.216453 | -0.004769 | 0.003584 | -0.227264 | 0.007802 | 1.000000 |
Strong Correlations:
Variable1 Variable2 Correlation
3 V1 V5 True
22 V1 V24 True
23 V1 V25 True
25 V1 V27 True
30 V1 V32 True
51 V2 V14 True
57 V2 V20 True
63 V2 V26 True
73 V2 V36 True
75 V2 V38 True
91 V3 V16 True
96 V3 V21 True
98 V3 V23 True
100 V3 V25 True
105 V3 V30 True
110 V3 V35 True
112 V3 V37 True
114 V3 V39 True
129 V4 V17 True
131 V4 V19 True
136 V4 V24 True
139 V4 V27 True
140 V4 V28 True
148 V4 V36 True
152 V4 V40 True
167 V5 V19 True
172 V5 V24 True
173 V5 V25 True
175 V5 V27 True
180 V5 V32 True
182 V5 V34 True
191 V6 V8 True
194 V6 V11 True
203 V6 V20 True
212 V6 V29 True
216 V6 V33 True
221 V6 V38 True
228 V7 V11 True
232 V7 V15 True
235 V7 V18 True
259 V8 V9 True
262 V8 V12 True
264 V8 V14 True
266 V8 V16 True
267 V8 V17 True
273 V8 V23 True
276 V8 V26 True
287 V8 V37 True
288 V8 V38 True
294 V9 V12 True
298 V9 V16 True
303 V9 V21 True
330 V10 V17 True
340 V10 V27 True
347 V10 V34 True
349 V10 V36 True
352 V10 V39 True
363 V11 V20 True
372 V11 V29 True
388 V12 V16 True
409 V12 V37 True
415 V13 V15 True
421 V13 V21 True
428 V13 V28 True
438 V13 V38 True
447 V14 V20 True
453 V14 V26 True
463 V14 V36 True
465 V14 V38 True
471 V15 V18 True
474 V15 V21 True
499 V16 V21 True
501 V16 V23 True
522 V17 V20 True
525 V17 V23 True
529 V17 V27 True
530 V17 V28 True
541 V17 V39 True
544 V18 V19 True
559 V18 V34 True
575 V19 V28 True
581 V19 V34 True
582 V19 V35 True
584 V19 V37 True
587 V19 V40 True
594 V20 V26 True
601 V20 V33 True
602 V20 V34 True
606 V20 V38 True
610 V21 V22 True
623 V21 V35 True
650 V23 V25 True
652 V23 V27 True
655 V23 V30 True
657 V23 V32 True
662 V23 V37 True
667 V24 V25 True
669 V24 V27 True
674 V24 V32 True
685 V25 V27 True
688 V25 V30 True
690 V25 V32 True
691 V25 V33 True
710 V26 V37 True
717 V27 V30 True
719 V27 V32 True
726 V27 V39 True
731 V28 V31 True
734 V28 V34 True
742 V29 V30 True
745 V29 V33 True
755 V30 V32 True
756 V30 V33 True
762 V30 V39 True
769 V31 V36 True
784 V33 V34 True
800 V35 V37 True
803 V35 V40 True
807 V36 V39 True
- There are 40 input variables that all have complex interactions between the output variable, but there is no strong positive or negative correlation between any of the input variables and Target for the output variable.
- The variables that have strong correlations (positive or negative) are listed above.
- Since there are many variables with no clear correlation or strong correlation with the output variable this will be the extent of the bi-variate analysis as going further will add no additional insight.
Data Preprocessing¶
Train Validation Split¶
# defining the dependent and independent variables
X = train_data_copy.drop(["Target"], axis=1)
y = train_data_copy["Target"]
# Splitting the data into training and validation sets. We need to use stratify to maintain the same distribution of the target variable in both sets.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=VS,random_state=RS, stratify=y, shuffle=True)
X_test = test_data_copy.drop(["Target"], axis=1)
y_test = test_data_copy["Target"]
# Check that the percentage of the output variable is the same in both train and validation datasets
train_percentage = y_train.value_counts(normalize=True)
val_percentage = y_val.value_counts(normalize=True)
print("Train Data Output Percentage:\n\n", train_percentage)
print("Validation Data Output Percentage:\n", val_percentage)
# Print the shape of the training and validation sets
print("Training Data Shape:", X_train.shape, y_train.shape)
print("Validation Data Shape:", X_val.shape, y_val.shape)
Train Data Output Percentage: Target 0 0.944467 1 0.055533 Name: proportion, dtype: float64 Validation Data Output Percentage: Target 0 0.9446 1 0.0554 Name: proportion, dtype: float64 Training Data Shape: (15000, 40) (15000,) Validation Data Shape: (5000, 40) (5000,)
Initial Data Set Treatment¶
#Convert Target variable to an Float64 on the training, validation, and test sets to be 100% consistent
y_train = y_train.astype(float)
y_val = y_val.astype(float)
y_test = y_test.astype(float)
# Impute the missing values in the training, validation, and test sets using the median of each column.
# The median is used to avoid the influence of outliers. We use SimpleImputer to systematically handle missing values across multiple columns in a dataset.
imputer = SimpleImputer(strategy="median")
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
X_val = pd.DataFrame(imputer.fit_transform(X_val), columns=X_val.columns)
X_test = pd.DataFrame(imputer.fit_transform(X_test), columns=X_test.columns)
#Confirm that there are no missing values in the training, validation, and test sets
print("Missing Values in Training Set:", X_train.isnull().sum().sum())
print("Missing Values in Validation Set:", X_val.isnull().sum().sum())
print("Missing Values in Test Set:", X_test.isnull().sum().sum())
Missing Values in Training Set: 0 Missing Values in Validation Set: 0 Missing Values in Test Set: 0
Model Building¶
Model Evaluation Metrics and Plotting Functions¶
- To save and reduce code duplication I'm predefining methods that will be used frequently in thd data Modeling steps. The methods will cover:
- Performance metric calculations
- Model performance plots
# Method to calculate difference in performance metrics between two model lists
# Parameters:
# model_training_metrics_list (list): First list of models to compare.
# model_validation_metrics_list (list): Second list of models to compare.
# Returns:
# A DataFrame containing the difference in performance metrics between the two model lists.
def performance_metrics_difference(model_training_metrics_list, model_validation_metrics_list):
#Initialize empty DataFrame to store the difference in performance metrics
all_differences_df = pd.DataFrame()
#Loop through model_training_metrics_list and model_validation_metrics_list indices and calculate the difference in performance metrics
for index in range(len(model_training_metrics_list)):
model_training_metrics_df = model_training_metrics_list.iloc[index]
model_validation_metrics_df = model_validation_metrics_list.iloc[index]
# Calculate the absolute difference in performance metrics
difference_df = pd.DataFrame({
'Model': "Model " + str(index),
'Loss Difference': abs(model_training_metrics_df['Loss'] - model_validation_metrics_df['Loss']),
'F1 Score Difference': abs(model_training_metrics_df['F1 Score'] - model_validation_metrics_df['F1 Score']),
'Accuracy Score Difference': abs(model_training_metrics_df['Accuracy Score'] - model_validation_metrics_df['Accuracy Score']),
'Recall Score Difference': abs(model_training_metrics_df['Recall Score'] - model_validation_metrics_df['Recall Score']),
'Precision Score Difference': abs(model_training_metrics_df['Precision Score'] - model_validation_metrics_df['Precision Score'])
}, index=[0])
# Append the difference DataFrame to the all_differences_df
all_differences_df = pd.concat([all_differences_df, difference_df], ignore_index=True)
# Return the DataFrame containing the differences in performance metrics
return all_differences_df
def plot_loss_accuracy(history, name):
"""
Function to plot loss/accuracy
history: an object which stores the metrics and losses.
name: can be one of Loss or Accuracy
"""
fig, ax = plt.subplots() #Creating a subplot with figure and axes.
plt.plot(history.history[name]) #Plotting the train accuracy or train loss
plt.plot(history.history['val_'+name]) #Plotting the validation accuracy or validation loss
plt.title('Model ' + name.capitalize()) #Defining the title of the plot.
plt.ylabel(name.capitalize()) #Capitalizing the first letter.
plt.xlabel('Epoch') #Defining the label for the x-axis.
fig.legend(['Train', 'Validation'], loc="outside right upper") #Defining the legend, loc controls the position of the legend.
from sklearn.utils import class_weight
# Function to calculate performance metrics of a model
# Parameters:
# model: The trained model to evaluate.
# X: Features used for prediction.
# y: True labels for the features.
# model_name: Name of the model for identification in the output DataFrame.
# Returns:
# A DataFrame containing the performance metrics of the model, including F1 score, accuracy, recall, and precision.
def performance_metrics(model, X, y, model_name="Default"):
y_pred = model.predict(X) > THRESHOLD
f1 = f1_score(y, y_pred)
accuracy = accuracy_score(y, y_pred)
recall = recall_score(y, y_pred)
precision = precision_score(y, y_pred)
metrics_df = pd.DataFrame({
'Model': [model_name],
'F1 Score': [f1],
'Accuracy Score': [accuracy],
'Recall Score': [recall],
'Precision Score': [precision]
})
return metrics_df
def get_binary_prediction_value(y_pred):
return (y_pred > THRESHOLD).astype(int).flatten()
def evaluate_neural_network_on_recall(
model,
x,
y,
validation_data,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
model_name="Default",
plot_loss_graph=False
):
# Always Build the model against the specified training data and use the epochs, batch_size, and validation data passed into this function.
history = model.fit(X_train,
y_train,
epochs=epochs,
batch_size = batch_size,
validation_data=validation_data,
class_weight = CLASS_WEIGHTS,
verbose=0)
if plot_loss_graph:
plot_loss_accuracy(history, 'loss')
y_pred = get_binary_prediction_value(model.predict(x))
# Calculate the loss and recall of the model
loss, recall = model.evaluate(x, y, verbose=0)
f1 = f1_score(y, y_pred)
accuracy = accuracy_score(y, y_pred)
recall = recall_score(y, y_pred)
precision = precision_score(y, y_pred)
# To make things cleaner create a DataFrame out of the classification report as a dictionary
report = classification_report(y, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()
# Add a caption/title using Styler
styled_report_df = report_df.style.set_caption(model_name + " Classification Report")
metrics_df = pd.DataFrame({
'Model': [model_name],
'Loss': [loss],
'F1 Score': [f1],
'Accuracy Score': [accuracy],
'Recall Score': [recall],
'Precision Score': [precision]
})
#Return the metrics and report DataFrames
return metrics_df, styled_report_df
def generate_model_reports(model, model_name="Default"):
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
model_training_metrics_df, model_training_report_df = evaluate_neural_network_on_recall(
model = model,
x = X_train,
y = y_train,
validation_data=(X_val, y_val),
model_name=model_name + " Training",
plot_loss_graph=True
)
model_validation_metrics_df, model_validation_report_df = evaluate_neural_network_on_recall(
model = model,
x = X_val,
y = y_val,
validation_data=(X_val, y_val),
model_name= model_name + " Validation",
)
model.summary()
display(model_training_metrics_df)
display(model_validation_metrics_df)
display(model_training_report_df)
display(model_validation_report_df)
return model_training_metrics_df, model_validation_metrics_df, model_training_report_df, model_validation_report_df
Model Evaluation Criterion¶
- Since the largest expense to the company is misclassifying failures that would result in the most cost replacement fee, the model should focus on maximizing detection of actual failures (True Positives). As a result the performance metric that the model will focus on is recall.
Initial Model Building (Model 0)¶
- Let's start with a neural network consisting of
- just one hidden layer
- activation function of ReLU
- SGD as the optimizer
# Define the optimizer to be used for training the model
# Using SGD (Stochastic Gradient Descent) optimizer with default parameters
# Note: You can also use other optimizers like Adam, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.SGD()
model0 = Sequential()
model0.add(Dense(20, activation='relu', input_dim=X_train.shape[1]))
# Sigmoid output layer for binary classification
model0.add(Dense(1, activation='sigmoid'))
model0.compile(optimizer=optimizer,
loss='binary_crossentropy',
metrics=['Recall'])
# Generate the model reports
model0_training_metrics_df, model0_validation_metrics_df, model0_training_report_df, model0_validation_report_df = generate_model_reports(
model=model0,
model_name="Model 0"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 0s 876us/step 157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 921us/step
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 20) │ 820 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 1) │ 21 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 843 (3.30 KB)
Trainable params: 841 (3.29 KB)
Non-trainable params: 0 (0.00 B)
Optimizer params: 2 (12.00 B)
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 0 Training | 0.0908 | 0.846585 | 0.981733 | 0.907563 | 0.793284 |
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 0 Validation | 0.111384 | 0.787402 | 0.973 | 0.902527 | 0.698324 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.994518 | 0.986094 | 0.990289 | 14167.000000 |
| 1.0 | 0.793284 | 0.907563 | 0.846585 | 833.000000 |
| accuracy | 0.981733 | 0.981733 | 0.981733 | 0.981733 |
| macro avg | 0.893901 | 0.946829 | 0.918437 | 15000.000000 |
| weighted avg | 0.983343 | 0.981733 | 0.982308 | 15000.000000 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.994184 | 0.977133 | 0.985585 | 4723.000000 |
| 1.0 | 0.698324 | 0.902527 | 0.787402 | 277.000000 |
| accuracy | 0.973000 | 0.973000 | 0.973000 | 0.973000 |
| macro avg | 0.846254 | 0.939830 | 0.886493 | 5000.000000 |
| weighted avg | 0.977793 | 0.973000 | 0.974605 | 5000.000000 |
- On the initial model setup we achieve a recall score of .884 and .909 on the training and validation sets respectively which is a good start.
- The model is not overfit at this point.
Model Performance Improvement¶
Model 1¶
# Define the optimizer to be used for training the model
# Using SGD (Stochastic Gradient Descent) optimizer with default parameters
# Note: You can also use other optimizers like Adam, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.SGD()
# Define a more complex model with additional layers
# This model has two hidden layers with 40 and 20 neurons respectively, using ReLU activation functions.
# The output layer uses a sigmoid activation function for binary classification.
model1 = Sequential()
model1.add(Dense(40, activation='relu', input_dim=X_train.shape[1]))
model1.add(Dense(20, activation='relu'))
# Sigmoid output layer for binary classification
model1.add(Dense(1, activation='sigmoid'))
model1.compile(optimizer=optimizer,
loss='binary_crossentropy',
metrics=['Recall'])
# Generate the model reports
model1_training_metrics_df, model1_validation_metrics_df, model1_training_report_df, model1_validation_report_df = generate_model_reports(
model=model1,
model_name="Model 1"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 0s 928us/step 157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 983us/step
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 40) │ 1,640 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 20) │ 820 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_2 (Dense) │ (None, 1) │ 21 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 2,483 (9.70 KB)
Trainable params: 2,481 (9.69 KB)
Non-trainable params: 0 (0.00 B)
Optimizer params: 2 (12.00 B)
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 1 Training | 0.038931 | 0.936937 | 0.993 | 0.936375 | 0.9375 |
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 1 Validation | 0.119307 | 0.797428 | 0.9748 | 0.895307 | 0.718841 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.996259 | 0.996329 | 0.996294 | 14167.000000 |
| 1.0 | 0.937500 | 0.936375 | 0.936937 | 833.000000 |
| accuracy | 0.993000 | 0.993000 | 0.993000 | 0.993000 |
| macro avg | 0.966880 | 0.966352 | 0.966616 | 15000.000000 |
| weighted avg | 0.992996 | 0.993000 | 0.992998 | 15000.000000 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.993770 | 0.979462 | 0.986564 | 4723.000000 |
| 1.0 | 0.718841 | 0.895307 | 0.797428 | 277.000000 |
| accuracy | 0.974800 | 0.974800 | 0.974800 | 0.974800 |
| macro avg | 0.856305 | 0.937385 | 0.891996 | 5000.000000 |
| weighted avg | 0.978539 | 0.974800 | 0.976086 | 5000.000000 |
- The model achieves a recall score of .943 and .898 on the training and validation sets respectively which is a marginal improvement only on the training set when adding an additional hidden layer.
- The model is not overfit at this point.
- The loss values converge at around 13 and 15 epochs.
Model 2¶
# Define the optimizer to be used for training the model
# Using SGD (Stochastic Gradient Descent) optimizer with default parameters
# Note: You can also use other optimizers like Adam, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.SGD()
# Define a more complex model with additional layers
# This model has three hidden layers with 40, 20, and 10 neurons respectively, using ReLU activation functions.
# The output layer uses a sigmoid activation function for binary classification.
model2 = Sequential()
model2.add(Dense(40, activation='relu', input_dim=X_train.shape[1]))
model2.add(Dense(20, activation='relu'))
model2.add(Dense(10, activation='relu'))
# Sigmoid output layer for binary classification
model2.add(Dense(1, activation='sigmoid'))
model2.compile(optimizer=optimizer,
loss='binary_crossentropy',
metrics=['Recall'])
# Generate the model reports
model2_training_metrics_df, model2_validation_metrics_df, model2_training_report_df, model2_validation_report_df = generate_model_reports(
model=model2,
model_name="Model 2"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 983us/step 157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 960us/step
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 40) │ 1,640 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 20) │ 820 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_2 (Dense) │ (None, 10) │ 210 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_3 (Dense) │ (None, 1) │ 11 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 2,683 (10.48 KB)
Trainable params: 2,681 (10.47 KB)
Non-trainable params: 0 (0.00 B)
Optimizer params: 2 (12.00 B)
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 2 Training | 0.069391 | 0.882751 | 0.986133 | 0.939976 | 0.832094 |
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 2 Validation | 0.090182 | 0.847059 | 0.9818 | 0.909747 | 0.792453 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.996444 | 0.988847 | 0.992631 | 14167.000000 |
| 1.0 | 0.832094 | 0.939976 | 0.882751 | 833.000000 |
| accuracy | 0.986133 | 0.986133 | 0.986133 | 0.986133 |
| macro avg | 0.914269 | 0.964412 | 0.937691 | 15000.000000 |
| weighted avg | 0.987317 | 0.986133 | 0.986529 | 15000.000000 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.994660 | 0.986026 | 0.990324 | 4723.000000 |
| 1.0 | 0.792453 | 0.909747 | 0.847059 | 277.000000 |
| accuracy | 0.981800 | 0.981800 | 0.981800 | 0.981800 |
| macro avg | 0.893557 | 0.947887 | 0.918692 | 5000.000000 |
| weighted avg | 0.983458 | 0.981800 | 0.982387 | 5000.000000 |
- The model achieves a recall score of .935 and .891 on the training and validation sets respectively. The training set performs about the same and the validation set performs about the same as the previous model.
- The model is not overfit at this point and the training and validation recall scores are also very close to each other. Adding the additional model complexity with the addition of a 3rd layer doesn't really improve the performance.
Model 3¶
# Define the optimizer to be used for training the model
# Using Adam optimizer with a specified learning rate
# Note: You can also use other optimizers like SGD, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
# Define a more complex model with additional layers
# This model has three hidden layers with 40, 20, and 10 neurons respectively, using ReLU activation functions.
# The output layer uses a sigmoid activation function for binary classification.
model3 = Sequential()
model3.add(Dense(40, activation='relu', input_dim=X_train.shape[1]))
model3.add(Dense(20, activation='relu'))
model3.add(Dense(10, activation='relu'))
# Sigmoid output layer for binary classification
model3.add(Dense(1, activation='sigmoid'))
model3.compile(optimizer=optimizer,
loss='binary_crossentropy',
metrics=['Recall'])
# Generate the model reports
model3_training_metrics_df, model3_validation_metrics_df, model3_training_report_df, model3_validation_report_df = generate_model_reports(
model=model3,
model_name="Model 3"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 962us/step 157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 963us/step
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 40) │ 1,640 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 20) │ 820 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_2 (Dense) │ (None, 10) │ 210 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_3 (Dense) │ (None, 1) │ 11 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 8,045 (31.43 KB)
Trainable params: 2,681 (10.47 KB)
Non-trainable params: 0 (0.00 B)
Optimizer params: 5,364 (20.96 KB)
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 3 Training | 0.048698 | 0.934211 | 0.992667 | 0.937575 | 0.93087 |
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 3 Validation | 0.107645 | 0.856164 | 0.9832 | 0.902527 | 0.814332 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.996328 | 0.995906 | 0.996117 | 14167.000000 |
| 1.0 | 0.930870 | 0.937575 | 0.934211 | 833.000000 |
| accuracy | 0.992667 | 0.992667 | 0.992667 | 0.992667 |
| macro avg | 0.963599 | 0.966741 | 0.965164 | 15000.000000 |
| weighted avg | 0.992693 | 0.992667 | 0.992679 | 15000.000000 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.994247 | 0.987931 | 0.991079 | 4723.000000 |
| 1.0 | 0.814332 | 0.902527 | 0.856164 | 277.000000 |
| accuracy | 0.983200 | 0.983200 | 0.983200 | 0.983200 |
| macro avg | 0.904289 | 0.945229 | 0.923622 | 5000.000000 |
| weighted avg | 0.984279 | 0.983200 | 0.983605 | 5000.000000 |
- The model achieves a recall score of .951 and .906 on the training and validation sets respectively. The training set performs better and the validation set performs better than the previous model.
- Adding the additional model complexity with the addition of a 3rd layer improves the performance slightly.
Model 4¶
# Define the optimizer to be used for training the model
# Using Adam optimizer with a specified learning rate
# Note: You can also use other optimizers like SGD, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
# Define a more complex model with additional layers
# This model has three hidden layers with 40, 20, and 10 neurons respectively, using ReLU activation functions.
# The second layer has a batch normalization layer added to improve training stability and performance.
# The output layer uses a sigmoid activation function for binary classification.
model4 = Sequential()
model4.add(Dense(40, activation='relu', input_dim=X_train.shape[1]))
model4.add(Dense(20, activation='relu'))
#Add batch normalization layer after the 2nd layer
model4.add(BatchNormalization())
model4.add(Dense(10, activation='relu'))
# Sigmoid output layer for binary classification
model4.add(Dense(1, activation='sigmoid'))
model4.compile(optimizer=optimizer,
loss='binary_crossentropy',
metrics=['Recall'])
# Generate the model reports
model4_training_metrics_df, model4_validation_metrics_df, model4_training_report_df, model4_validation_report_df = generate_model_reports(
model=model4,
model_name="Model 4"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 1ms/step 157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 40) │ 1,640 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 20) │ 820 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ batch_normalization │ (None, 20) │ 80 │ │ (BatchNormalization) │ │ │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_2 (Dense) │ (None, 10) │ 210 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_3 (Dense) │ (None, 1) │ 11 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 8,205 (32.05 KB)
Trainable params: 2,721 (10.63 KB)
Non-trainable params: 40 (160.00 B)
Optimizer params: 5,444 (21.27 KB)
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 4 Training | 0.050145 | 0.925391 | 0.991733 | 0.923169 | 0.927624 |
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 4 Validation | 0.075911 | 0.870466 | 0.985 | 0.909747 | 0.834437 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.995484 | 0.995765 | 0.995624 | 14167.000000 |
| 1.0 | 0.927624 | 0.923169 | 0.925391 | 833.000000 |
| accuracy | 0.991733 | 0.991733 | 0.991733 | 0.991733 |
| macro avg | 0.961554 | 0.959467 | 0.960508 | 15000.000000 |
| weighted avg | 0.991715 | 0.991733 | 0.991724 | 15000.000000 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.994679 | 0.989414 | 0.992039 | 4723.000000 |
| 1.0 | 0.834437 | 0.909747 | 0.870466 | 277.000000 |
| accuracy | 0.985000 | 0.985000 | 0.985000 | 0.985000 |
| macro avg | 0.914558 | 0.949580 | 0.931253 | 5000.000000 |
| weighted avg | 0.985801 | 0.985000 | 0.985304 | 5000.000000 |
- The model achieves a recall score of .923 and .909 on the training and validation sets respectively. The training set and validation set performs slightly worse than the previous model.
- Adding the additional model complexity with the addition of a 3rd layer and Batch Normalization does not help as much to improve performance on the model overall.
Model 5¶
# Define the optimizer to be used for training the model
# Using Adam optimizer with a specified learning rate
# Note: You can also use other optimizers like SGD, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
# Define a more complex model with additional layers
# This model has three hidden layers with 40, 20, and 10 neurons respectively, using ReLU activation functions.
# The second layer has a dropout layer added to prevent overfitting.
# The third layer has a batch normalization layer added to improve training stability and performance.
# The output layer uses a sigmoid activation function for binary classification.
# Note: The dropout rate is set to 0.3, which means 30% of the neurons will be randomly dropped during training.
model5 = Sequential()
model5.add(Dense(40, activation='relu', input_dim=X_train.shape[1]))
model5.add(Dense(20, activation='relu'))
# Add dropout layer after the 2nd layer to prevent overfitting
model5.add(Dropout(0.3))
model5.add(Dense(10, activation='relu'))
#Add batch normalization layer after the 3rd layer
model5.add(BatchNormalization())
# Sigmoid output layer for binary classification
model5.add(Dense(1, activation='sigmoid'))
model5.compile(optimizer=optimizer,
loss='binary_crossentropy',
metrics=['Recall'])
# Generate the model reports
model5_training_metrics_df, model5_validation_metrics_df, model5_training_report_df, model5_validation_report_df = generate_model_reports(
model=model5,
model_name="Model 5"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 1ms/step 157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 40) │ 1,640 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 20) │ 820 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout (Dropout) │ (None, 20) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_2 (Dense) │ (None, 10) │ 210 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ batch_normalization │ (None, 10) │ 40 │ │ (BatchNormalization) │ │ │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_3 (Dense) │ (None, 1) │ 11 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 8,125 (31.74 KB)
Trainable params: 2,701 (10.55 KB)
Non-trainable params: 20 (80.00 B)
Optimizer params: 5,404 (21.11 KB)
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 5 Training | 0.068797 | 0.91432 | 0.990467 | 0.915966 | 0.912679 |
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 5 Validation | 0.098513 | 0.918033 | 0.991 | 0.909747 | 0.926471 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.995058 | 0.994847 | 0.994953 | 14167.000000 |
| 1.0 | 0.912679 | 0.915966 | 0.914320 | 833.000000 |
| accuracy | 0.990467 | 0.990467 | 0.990467 | 0.990467 |
| macro avg | 0.953869 | 0.955407 | 0.954636 | 15000.000000 |
| weighted avg | 0.990483 | 0.990467 | 0.990475 | 15000.000000 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.994712 | 0.995765 | 0.995239 | 4723.000000 |
| 1.0 | 0.926471 | 0.909747 | 0.918033 | 277.000000 |
| accuracy | 0.991000 | 0.991000 | 0.991000 | 0.991000 |
| macro avg | 0.960591 | 0.952756 | 0.956636 | 5000.000000 |
| weighted avg | 0.990932 | 0.991000 | 0.990961 | 5000.000000 |
- The model achieves a recall score of .913 and .909 on the training and validation sets respectively. The training set and validation set both perform about the same as the previous model and maybe slightly worse.
Model 6¶
# Define the optimizer to be used for training the model
# Using Adam optimizer with a specified learning rate
# Note: You can also use other optimizers like SGD, RMSprop, etc. based on your requirements.
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
# Define a more complex model with additional layers
# This model has four hidden layers with 40, 20, 10, and 5 neurons respectively, using ReLU activation functions.
# The second and fourth layers have dropout layers added to prevent overfitting.
# The third layer has a batch normalization layer added to improve training stability and performance.
# The output layer uses a sigmoid activation function for binary classification.
# Note: The dropout rate is set to 0.3, which means 30% of the neurons will be randomly dropped during training.
model6 = Sequential()
model6.add(Dense(40, activation='relu', input_dim=X_train.shape[1]))
model6.add(Dense(20, activation='relu'))
# Add dropout layer after the 2nd layer to prevent overfitting
model6.add(Dropout(0.3))
model6.add(Dense(10, activation='relu'))
#Add batch normalization layer after the 3rd layer
model6.add(BatchNormalization())
model6.add(Dense(5, activation='relu'))
# Add dropout layer after the 4th layer to prevent overfitting
model6.add(Dropout(0.3))
# Sigmoid output layer for binary classification
model6.add(Dense(1, activation='sigmoid'))
model6.compile(optimizer=optimizer,
loss='binary_crossentropy',
metrics=['Recall'])
# Generate the model reports
model6_training_metrics_df, model6_validation_metrics_df, model6_training_report_df, model6_validation_report_df = generate_model_reports(
model=model6,
model_name="Model 6"
)
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 1ms/step 157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 40) │ 1,640 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 20) │ 820 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout (Dropout) │ (None, 20) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_2 (Dense) │ (None, 10) │ 210 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ batch_normalization │ (None, 10) │ 40 │ │ (BatchNormalization) │ │ │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_3 (Dense) │ (None, 5) │ 55 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_1 (Dropout) │ (None, 5) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_4 (Dense) │ (None, 1) │ 6 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 8,275 (32.33 KB)
Trainable params: 2,751 (10.75 KB)
Non-trainable params: 20 (80.00 B)
Optimizer params: 5,504 (21.50 KB)
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 6 Training | 0.079933 | 0.919374 | 0.991067 | 0.917167 | 0.921592 |
| Model | Loss | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|---|
| 0 | Model 6 Validation | 0.073594 | 0.904505 | 0.9894 | 0.906137 | 0.902878 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.995131 | 0.995412 | 0.995271 | 14167.000000 |
| 1.0 | 0.921592 | 0.917167 | 0.919374 | 833.000000 |
| accuracy | 0.991067 | 0.991067 | 0.991067 | 0.991067 |
| macro avg | 0.958362 | 0.956289 | 0.957323 | 15000.000000 |
| weighted avg | 0.991047 | 0.991067 | 0.991057 | 15000.000000 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.994494 | 0.994283 | 0.994389 | 4723.000000 |
| 1.0 | 0.902878 | 0.906137 | 0.904505 | 277.000000 |
| accuracy | 0.989400 | 0.989400 | 0.989400 | 0.989400 |
| macro avg | 0.948686 | 0.950210 | 0.949447 | 5000.000000 |
| weighted avg | 0.989418 | 0.989400 | 0.989409 | 5000.000000 |
- The model achieves a recall score of .914 and .909 on the training and validation sets respectively. The training set and validation both perform about the same as the previous model.
- Adding the additional drop out layers, additional hidden layers, and normalization layers is making the model overly complex without adding any value.
Model Performance Comparison and Final Model Selection¶
Now, in order to select the final model, we will compare the performances of all the models for the training and validation sets.
# Collect all of the training model metrics DataFrames into a single DataFrame
model_training_metrics_list = pd.concat([
model0_training_metrics_df,
model1_training_metrics_df,
model2_training_metrics_df,
model3_training_metrics_df,
model4_training_metrics_df,
model5_training_metrics_df,
model6_training_metrics_df
], ignore_index=True)
# Collect all of the validation model metrics DataFrames into a single DataFrame
model_validation_metrics_list = pd.concat([
model0_validation_metrics_df,
model1_validation_metrics_df,
model2_validation_metrics_df,
model3_validation_metrics_df,
model4_validation_metrics_df,
model5_validation_metrics_df,
model6_validation_metrics_df
], ignore_index=True)
#Calculate the difference in performance metrics between the training and validation model metrics DataFrames
model_metrics_difference_df = performance_metrics_difference(
model_training_metrics_list,
model_validation_metrics_list
)
#Sort the model metrics difference DataFrame by the Recall Score Difference in ascending order
model_metrics_difference_df = model_metrics_difference_df.sort_values(by='Recall Score Difference')
display(model_training_metrics_list.T)
display(model_validation_metrics_list.T)
display(model_metrics_difference_df.T)
#Print the best model based on the validation recall score and that has the smallest difference in recall score between the training and validation sets
display(model_metrics_difference_df.iloc[0].T)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| Model | Model 0 Training | Model 1 Training | Model 2 Training | Model 3 Training | Model 4 Training | Model 5 Training | Model 6 Training |
| Loss | 0.0908 | 0.038931 | 0.069391 | 0.048698 | 0.050145 | 0.068797 | 0.079933 |
| F1 Score | 0.846585 | 0.936937 | 0.882751 | 0.934211 | 0.925391 | 0.91432 | 0.919374 |
| Accuracy Score | 0.981733 | 0.993 | 0.986133 | 0.992667 | 0.991733 | 0.990467 | 0.991067 |
| Recall Score | 0.907563 | 0.936375 | 0.939976 | 0.937575 | 0.923169 | 0.915966 | 0.917167 |
| Precision Score | 0.793284 | 0.9375 | 0.832094 | 0.93087 | 0.927624 | 0.912679 | 0.921592 |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| Model | Model 0 Validation | Model 1 Validation | Model 2 Validation | Model 3 Validation | Model 4 Validation | Model 5 Validation | Model 6 Validation |
| Loss | 0.111384 | 0.119307 | 0.090182 | 0.107645 | 0.075911 | 0.098513 | 0.073594 |
| F1 Score | 0.787402 | 0.797428 | 0.847059 | 0.856164 | 0.870466 | 0.918033 | 0.904505 |
| Accuracy Score | 0.973 | 0.9748 | 0.9818 | 0.9832 | 0.985 | 0.991 | 0.9894 |
| Recall Score | 0.902527 | 0.895307 | 0.909747 | 0.902527 | 0.909747 | 0.909747 | 0.906137 |
| Precision Score | 0.698324 | 0.718841 | 0.792453 | 0.814332 | 0.834437 | 0.926471 | 0.902878 |
| 0 | 5 | 6 | 4 | 2 | 3 | 1 | |
|---|---|---|---|---|---|---|---|
| Model | Model 0 | Model 5 | Model 6 | Model 4 | Model 2 | Model 3 | Model 1 |
| Loss Difference | 0.020584 | 0.029716 | 0.006339 | 0.025766 | 0.02079 | 0.058947 | 0.080376 |
| F1 Score Difference | 0.059183 | 0.003713 | 0.01487 | 0.054925 | 0.035692 | 0.078046 | 0.139509 |
| Accuracy Score Difference | 0.008733 | 0.000533 | 0.001667 | 0.006733 | 0.004333 | 0.009467 | 0.0182 |
| Recall Score Difference | 0.005036 | 0.006219 | 0.01103 | 0.013422 | 0.030229 | 0.035048 | 0.041068 |
| Precision Score Difference | 0.09496 | 0.013791 | 0.018715 | 0.093187 | 0.039641 | 0.116538 | 0.218659 |
Model Model 0 Loss Difference 0.020584 F1 Score Difference 0.059183 Accuracy Score Difference 0.008733 Recall Score Difference 0.005036 Precision Score Difference 0.09496 Name: 0, dtype: object
- After calculating the differences between the training and validation sets across the recall score, model #0 had the least difference between the training and validation set. So this will be the model that we will evaluate against our testing set.
Now, let's check the performance of the final model on the test set.
# Calculate the performance metrics of the best model on the test set
best_model_test_perf = performance_metrics(model0,X_test,y_test,"Model 0 Test")
y_test_pred_best = model0.predict(X_test)
# To make things cleaner create a DataFrame out of the classification report as a dictionary
report = classification_report(y_test, y_test_pred_best>THRESHOLD, output_dict=True)
report_df = pd.DataFrame(report).transpose()
# Add a caption/title using Styler
styled_report_df = report_df.style.set_caption("Model 0 Test Set Classification Report")
display(best_model_test_perf)
display(styled_report_df)
draw_confusion_matrix(model0, X_test, y_test, "Model 0 Test Set Confusion Matrix")
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 997us/step
| Model | F1 Score | Accuracy Score | Recall Score | Precision Score | |
|---|---|---|---|---|---|
| 0 | Model 0 Test | 0.767802 | 0.97 | 0.879433 | 0.681319 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.992666 | 0.975413 | 0.983964 | 4718.000000 |
| 1.0 | 0.681319 | 0.879433 | 0.767802 | 282.000000 |
| accuracy | 0.970000 | 0.970000 | 0.970000 | 0.970000 |
| macro avg | 0.836992 | 0.927423 | 0.875883 | 5000.000000 |
| weighted avg | 0.975106 | 0.970000 | 0.971773 | 5000.000000 |
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step
Actionable Insights and Recommendations¶
- The model is able to accurately detect ~87% of the wind turbines that are about to fail. Based on the confusion matrix, it correctly predicts the failures of 248 out of 282 of the actual failures.
- As stated previously the primary metric that the modeling focused on was recall since the biggest financial risk to the company is the wind turbine replacement cost. So staying ahead of this is a safer bet.
- In the worst case scenario if the model classifies a wind turbines as about to fail but doesn't, it costs the company the least amount of money because it only requires an inspection fee vs a replacement fee.
- Since the input parameters are redacted for security reasons, the best that we were able to do was to indicate based on the bi-variate analysis which input parameters were highly correlated to each other. However, there was no high correlation of any of the input values on the Target output at all. The business will know best on what those exact parameters are.
- While the neural network model does a great job of capturing complex interactions between the 40 input parameters, this same ability to adapt towards complexity does not grant us the ability to describe feature(input) importance as we can do with other types of machine learning algorithms.
- One of the major challenges with the data set is that it was heavily skewed towards non-failures, so this definitely contributed to keeping the model prediction within the high 80th percentile range of true failure detection. If possible, it would be helpful to gather more failure scenarios in the data set and to be able to have a more balanced data set in general for the model to learn from. Other things that we could try in the future would be to use techniques such as SMOTE to synthetically create failure scenarios to fill out the data set.
- Lastly, it would be helpful to understand from the business based on this first generation model if there is an ideal target prediction performance that the business would like the model to achieve.
Export the project to HTML¶
# Export the project to HTML
!jupyter nbconvert --to html "ErnestHolloway-INN_ReneWind_Main_Project_FullCode_Notebook-8-4-25.ipynb"
[NbConvertApp] Converting notebook ErnestHolloway-INN_ReneWind_Main_Project_FullCode_Notebook-8-4-25.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 50 image(s). [NbConvertApp] Writing 4976155 bytes to ErnestHolloway-INN_ReneWind_Main_Project_FullCode_Notebook-8-4-25.html